---
status: proposed
creation-date: "2023-02-08"
authors: [ "@mattkasa", "@jon_jenkins" ]
coach: "@DylanGriffith"
approvers: [ "@rogerwoo", "@alexives" ]
owning-stage: "~devops::data_stores"
participating-stages: []
---

# Automated Query Analysis

## Problem Summary

Our overarching goal is to improve the reliability and throughput of the GitLab
database review process. The current process requires merge request authors to
manually provide query plans and raw SQL when introducing new queries or
updating existing queries. This is both time consuming and error prone.

We believe we can improve operational efficiency by automatically identifying
and analyzing newly introduced SQL queries. This will reduce the risk of human
error, leading to improved system stability and an overall reduction in
performance regressions.

Our key success metric is a reduction in the number of manual actions required
by both code contributors and database reviewers, while maintaining a consistent
standard for database related code contributions.

## Goals

1. Replace the current process of the author manually obtaining SQL and query
   plans with an automated process.
1. Decrease the incidence of performance regressions when poorly performing
   queries are missed by a manual process.
1. Increase contributor and reviewer efficiency by automating the query testing
   portion of database review.

## Challenges

- Capturing the number of SQL queries generated by an application the size of
  `gitlab-org/gitlab` without causing an increase in CI time and/or resources
  may present a challenge.
- Storing the number of SQL queries generated by an application the size of
  `gitlab-org/gitlab` may consume large amounts of database storage.

## Opportunity

- Automated test suites already generate a large number of SQL queries, for
  instance  `rspec` test suites, that can be captured and used to perform
  automated analysis.
- We already utilize `postgres.ai` to analyze query performance, and it has an
  API that will allow us to automate the creation of database clones with
  realistic production-like data in order to perform automated analysis.
- For customers who do not use something like `postgres.ai`, but who are
  connecting to a test database in CI, we would use this connection to generate
  query plans. The accuracy of these query plans will be affected by how
  realistic the test data is, and can be improved by seeding the test database
  with production-like data.
- By storing queries and their query plans, we can tokenize the query plan into
  plan components, assign a cost and weight, then match those against a machine
  learning model. We can build this model by generating query plans for queries
  in our slow query logs, and assign actual cost and weight to their plan
  components. This will allow us to leverage our corpus of queries and slow
  query logs to predict the performance of arbitrary query text for other
  applications and our customers.

## Proposal

We plan to automate the process of identifying new and changed database queries,
so that contributors and reviewers can more accurately and efficiently assess
the database performance impact of a code change.

We will capture queries generated as a side effect of running tests in CI,
normalize them, deduplicate them, analyze them using one or more analyzers, and
store them with their analyses and other metadata for future retrieval and
comparison.

We will post a comment to the originating merge request, containing a summary of
the new and changed queries, with links to their analyses, and highlighting any
queries that exceed established timing or other performance guidelines.

## Design and implementation details

### Iteration 1

In the first iteration we will focus on how we capture queries, including
normalization, deduplication, and storage. We must consider the performance and
resource impacts on CI pipelines during capture, and include things like
partitioning and time decay for the information we are storing.

#### Capturing queries

We will strive to limit the time and resource impacts on our CI pipelines as
much as possible. These are some of the options we will consider for capturing
queries:

- **Instrumenting `ActiveRecord` in `ruby`**
  - Challenges:
    - Only applies to `ruby` projects so it would not be applicable to projects
      like `container-registry`.
    - Has a non-zero impact on time and resources in CI pipelines (these impacts
      can be observed in
      [!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638))
  - Opportunities:
    - Simple and straightforward to implement.
    - Allows access to more information (eg. stacktrace and calling locations).
- **Connection proxy with logging**
  - Challenges:
    - Adds complexity and possible performance overhead.
    - Requires maintaining the code for the proxy.
  - Opportunities:
    - Allows us to customize the capture.
    - Allows us to perform normalization/deduplication at capture time.
- **Built-in logging in `postgresql`**
  - Challenges:
    - Require adding a configuration to enable logging.
    - May be difficult to obtain the resulting logs.
  - Opportunities:
    - Doesn't require maintaining any code.
    - Light weight in terms of performance impact.
- **Capture from `pg_stat_statements`**
  - Challenges:
    - Requires creating the extension in the test database.
    - Requires adding a configuration to set `pg_stat_statements.max` to a value
      high enough to capture all queries.
    - Consumes shared memory proportional to `pg_stat_statements.max`.
  - Opportunities:
    - Requires minimal code.
    - Simple to obtain the data.
    - Data is already normalized.

We have already built a proof of concept for instrumenting `ActiveRecord` in
`ruby` in
[!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638), so as a
first step we will benchmark the other capture methods against it and select the
best option.

#### Storing queries

For the next step of the first iteration we will use the proof of concept in
[!111638](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/111638) as well
as any data gathered from testing other capture methods to estimate the number
of rows per project, and use the pipeline execution statistics for
`gitlab-org/gitlab` to estimate throughput. These estimates will allow us to
evaluate storage mechanisms that are suitable for our purpose.

Some of the storage mechanisms we plan to evaluate are:

- **In the `ci` database in the GitLab database instance**
  - Challenges:
    - Places additional strain on this resource for `GitLab.com`.
  - Opportunities:
    - Allows us to utilize existing authentication and access control in the form of `CI_JOB_TOKEN`.
    - Allows us to leverage associations with `ci_builds` and `ci_pipelines`.
    - Simplifies deployment for self-managed.
- **In a new decomposed database in the GitLab database instance**
  - Challenges:
    - Adds to required development and testing effort.
    - Adds to deployment effort for `GitLab.com`.
  - Opportunities:
    - Isolates database performance impacts from the existing `main` and `ci` database instances.
- **In a new external service**
  - Challenges:
    - Adds to required development and testing effort.
    - Adds to deployment effort for `GitLab.com` and for self-managed.
  - Opportunities:
    - Isolates performance impacts from `gitlab-org/gitlab`.
    - Allows us to iterate faster without impacting the main application.
- **In ClickHouse**
  - Challenges:
    - Not yet available for self-managed.
  - Opportunities:
    - Isolates database performance impacts from the existing `main` and `ci` database instances.

An example database schema for storing queries:

```sql
CREATE TABLE queries (
    created_at timestamp with time zone NOT NULL,
    updated_at timestamp with time zone NOT NULL,
    id bigint NOT NULL,
    project_id bigint NOT NULL,
    analysis_id bigint,
    hash text,
    sql text
);
CREATE TABLE pipeline_queries (
    id bigint NOT NULL,
    project_id bigint NOT NULL,
    pipeline_id bigint NOT NULL,
    query_id bigint NOT NULL
);
CREATE TABLE analyses (
    created_at timestamp with time zone NOT NULL,
    updated_at timestamp with time zone NOT NULL,
    id bigint NOT NULL,
    project_id bigint NOT NULL,
    query_id bigint NOT NULL,
    buffers int,
    walltime int,
    explain text,
    analysis_url text
);
```

One possible method of partitioning a schema like the above example would be to
utilize
[sub-partitioning](https://github.com/pgpartman/pg_partman/blob/master/doc/pg_partman.md#sub-partitioning).
If we partition by `project_id` then by some interval of `updated_at`, and touch
the row when we see a query, we can store only queries that the codebase is
still executing, and prune partitions that only contain queries the code is no
longer generating.

### Iteration 2

In the second iteration we plan to identify new and changed queries, and post MR
comments containing a summary. We will begin soliciting feedback on the accuracy
and utility of the information, and improve or filter it to maximize it's
usefulness.

### Iteration 3+

In the third and following iterations we plan to automate query analysis using
one or more analyzers, store these analyses, and add them to the MR comments. We
also intend to re-evaluate our use of the database to store query information,
and the API to retrieve it, and potentially move this to an external service.
